libxl/xl: improve behaviour when guest fails to suspend itself.
The PV suspend protocol requires guest co-operating whereby the guest
must respond to a suspend request written to the xenstore control node
by clearing the node and then making a suspend hypercall.
Currently when a guest fails to do this libxl times out and returns
a generic failure code to the caller.
In response to this failure xl attempts to resume the guest. However
if the guest has not responded to the suspend request then the is no
guarantee that the guest has made the suspend hypercall (in fact it is
quite unlikely). Since the resume process attempts to modify the
return value of the hypercall (to indicate a cancelled suspend) this
results in the guest eax/rax register being corrupted!
To fix this change libxl to do the following:
* Wait for the guest to acknowledge the suspend request.
- on timeout cancel the suspend request.
- if cancellation is successful then return a new error code to
indicate that the guest is not responding.
- if the cancel does not succeed then we raced with the guest
which actually did acknowledge at the last minute, so
continue.
* Wait for the guest to suspend.
- on timeout return the standard error code as before
* Guest successfully suspended, return success.
Lastly in xl do not attempt to resume a guest if it has not responded
to the suspend request.
Tested by live migration of PVops kernels which either ignore the
suspend request, have already crashed and those which suspend/resume
correctly. In the first two cases the source domain is left alone (and
continues to function in the first case) and in the third the
migration is successful.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>